Target Seeking Crawlers and their Topical Performance

نویسندگان

  • Padmini Srinivasan
  • Gautam Pant
  • Filippo Menczer
چکیده

Topic driven crawlers can complement search engines by targeting relevant portions of the Web. A topic driven crawler must exploit the information available about the topic and its underlying context. In this paper we extend our previous research on the design and evaluation of topic driven crawlers by comparing seven different crawlers on a harder problem, namely, seeking highly relevant target pages. We find that exploration is an important aspect of a crawling strategy. We also study how the performance of crawler strategies depends on a number of topical characteristics based on notions of topic generality, cohesiveness, and authoritativeness. Our results reveal that topic generality is an obstacle for most crawlers, that three crawlers tend to perform better when the target pages are clustered together, and that two of these also display better performance when topic targets are highly authoritative.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Defining Evaluation Methodologies for Topical Crawlers

Topical crawlers are becoming important tools to support applications such as specialized Web portals, online searching, and competitive intelligence. As the Web mining field matures, the disparate crawling strategies proposed in the literature will have to be evaluated and compared on common tasks through well-defined performance measures. We have argued that general evaluation methodologies a...

متن کامل

Learning to Crawl: Classifier-guided Topical Crawlers

Topical or focused crawlers follow the hyperlinked structure of the Web guided by the scent of information to identify and harvest topically relevant pages. For sniffing the appropriate scent they mine the content of pages that are already fetched to prioritize the fetching of unvisited pages. Topical crawling is currently a young and creative area of research that holds the promise of benefiti...

متن کامل

Crawling the Web

The large size and the dynamic nature of the Web highlight the need for continuous support and updating of Web based information retrieval systems. Crawlers facilitate the process by following the hyperlinks in Web pages to automatically download a partial snapshot of the Web. While some systems rely on crawlers that exhaustively crawl the Web, others incorporate “focus” within their crawlers t...

متن کامل

Topical Crawling for Business Intelligence

The Web provides us with a vast resource for business intelligence. However, the large size of the Web and its dynamic nature make the task of foraging appropriate information challenging. Generalpurpose search engines and business portals may be used to gather some basic intelligence. Topical crawlers, driven by richer contexts, can then leverage on the basic intelligence to facilitate in-dept...

متن کامل

Topic-Driven Crawlers: Machine Learning Issues

Topic driven crawlers are increasingly seen as a way to address the scalability limitations of universal search engines, by distributing the crawling process across users, queries, or even client computers. The context available to such crawlers can guide the navigation of links with the goal of efficiently locating highly relevant target pages. We developed a framework to fairly evaluate topic...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002